Search CORE

27 research outputs found

Demystifying Map Space Exploration for NPUs

Author: Kao Sheng-Chun
Krishna Tushar
Parashar Angshuman
Tsai Po-An
Publication venue
Publication date: 07/10/2022
Field of study

Map Space Exploration is the problem of finding optimized mappings of a Deep Neural Network (DNN) model on an accelerator. It is known to be extremely computationally expensive, and there has been active research looking at both heuristics and learning-based methods to make the problem computationally tractable. However, while there are dozens of mappers out there (all empirically claiming to find better mappings than others), the research community lacks systematic insights on how different search techniques navigate the map-space and how different mapping axes contribute to the accelerator's performance and efficiency. Such insights are crucial to developing mapping frameworks for emerging DNNs that are increasingly irregular (due to neural architecture search) and sparse, making the corresponding map spaces much more complex. In this work, rather than proposing yet another mapper, we do a first-of-its-kind apples-to-apples comparison of search techniques leveraged by different mappers. Next, we extract the learnings from our study and propose two new techniques that can augment existing mappers -- warm-start and sparsity-aware -- that demonstrate speedups, scalability, and robustness across diverse DNN models

arXiv.org e-Print Archive

LEAP Scratchpads: Automatic Memory and Cache Management for Reconfigurable Logic [Extended Version]

Author: Adler Michael
Emer Joel
Fleming Kermin E.
Parashar Angshuman
Pellauer Michael
Publication venue
Publication date: 23/11/2010
Field of study

CORRECTION: The authors for entry [4] in the references should have been "E. S. Chung, J. C. Hoe, and K. Mai".Developers accelerating applications on FPGAs or other reconfigurable logic have nothing but raw memory devices in their standard toolkits. Each project typically includes tedious development of single-use memory management. Software developers expect a programming environment to include automatic memory management. Virtual memory provides the illusion of very large arrays and processor caches reduce access latency without explicit programmer instructions. LEAP scratchpads for reconfigurable logic dynamically allocate and manage multiple, independent, memory arrays in a large backing store. Scratchpad accesses are cached automatically in multiple levels, ranging from shared on-board, RAM-based, set-associative caches to private caches stored in FPGA RAM blocks. In the LEAP framework, scratchpads share the same interface as on-die RAM blocks and are plug-in replacements. Additional libraries support heap management within a storage set. Like software developers, accelerator authors using scratchpads may focus more on core algorithms and less on memory management. Two uses of FPGA scratchpads are analyzed: buffer management in an H.264 decoder and memory management within a processor microarchitecture timing model

DSpace@MIT

eCNN: A Block-Based and Highly-Parallel CNN Accelerator for Edge Inference

Author: Agustsson Eirikur
Buckler Mark
Chen Qifeng
Deng J.
Han Song
He Kaiming
Hegde Kartik
Huang J.-B.
Ignatov Andrey
International Organization for Standardization.
Johnson Justin
Kim Jiwon
Krizhevsky Alex
Martin D.
Nah Seungjun
Parashar Angshuman
Simonyan Karen
Xygkis Athanasios
Zeyde R.
Zhu Jun-Yan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 12/10/2019
Field of study

Convolutional neural networks (CNNs) have recently demonstrated superior quality for computational imaging applications. Therefore, they have great potential to revolutionize the image pipelines on cameras and displays. However, it is difficult for conventional CNN accelerators to support ultra-high-resolution videos at the edge due to their considerable DRAM bandwidth and power consumption. Therefore, finding a further memory- and computation-efficient microarchitecture is crucial to speed up this coming revolution. In this paper, we approach this goal by considering the inference flow, network model, instruction set, and processor design jointly to optimize hardware performance and image quality. We apply a block-based inference flow which can eliminate all the DRAM bandwidth for feature maps and accordingly propose a hardware-oriented network model, ERNet, to optimize image quality based on hardware constraints. Then we devise a coarse-grained instruction set architecture, FBISA, to support power-hungry convolution by massive parallelism. Finally,we implement an embedded processor---eCNN---which accommodates to ERNet and FBISA with a flexible processing architecture. Layout results show that it can support high-quality ERNets for super-resolution and denoising at up to 4K Ultra-HD 30 fps while using only DDR-400 and consuming 6.94W on average. By comparison, the state-of-the-art Diffy uses dual-channel DDR3-2133 and consumes 54.3W to support lower-quality VDSR at Full HD 30 fps. Lastly, we will also present application examples of high-performance style transfer and object recognition to demonstrate the flexibility of eCNN.Comment: 14 pages; appearing in IEEE/ACM International Symposium on Microarchitecture (MICRO), 201

arXiv.org e-Print Archive

Crossref

SlicK: Slice-based Locality Exploitation . . .

Author: Angshuman Parashar
Sudhanva Gurumurthi et al.
Publication venue
Publication date
Field of study

Transient faults are expected a be a major design consideration in future microprocessors. Recent proposals for transient fault detection in processor cores have revolved around the idea of redundant threading, which involves redundant execution of a program across multiple execution contexts. This paper presents a new approach to redundant threading by bringing together the concepts of slice-level execution and value and control-flow locality into a novel partial redundant threading mechanism called SlicK. The purpose of redundant execution is to check the integrity of the outputs propagating out of the core (typically through stores). SlicK implements redundancy at the granularity of backward-slices of these output instructions and exploits value and control-flow locality to avoid redundantly executing slices that lead to predictable outputs, thereby avoiding redundant execution of a significant fraction of instructions while maintaining extremely low vulnerabilitie

CiteSeerX

Mechanisms for Bounding Vulnerabilities of Processor Structures ABSTRACT

Author: Anand Sivasubramaniam
Angshuman Parashar
Niranjan Soundararajan
Publication venue
Publication date
Field of study

Concern for the increasing susceptibility of processor structures to transient errors has led to several recent research efforts that propose architectural techniques to enhance reliability. However, real systems are typically required to satisfy hard reliability budgets, and barring expensive full-redundancy approaches, none of the proposed solutions treat any reliability budgets or bounds as hard constraints. Meeting vulnerability bounds requires monitoring vulnerabilities of processor structures and taking appropriate actions whenever these bounds are violated. This mandates treating reliability as a first-order microarchitecture design constraint, while optimizing performance as long as reliability requirements are satisfied. This paper makes three key contributions towards this goal: (i) we present a simple infrastructure to monitor and provide upper bounds on the vulnerabilities of key processor structures at cyclelevel fidelity; (ii) we propose two distinct control mechanisms – throttling and selective redundancy – to proactively and/or reactively bound the vulnerabilities to any limit specified by the system designer; (iii) within this framework, we propose a novel adaptation of Out-of-Order Commit for vulnerability reduction, which automatically provides additional leverage for the control mechanisms to boost performance while remaining within the reliability budget

CiteSeerX

An uncalibrated lightfield acquisition system

Author: Angshuman Parashar
Prem Kalra
Puneet Sharma
Subhashis Banerjee
Publication venue
Publication date: 01/01/2002
Field of study

Acquisition of image data for lightfield usually requires an expensive, complex and bulky setup. In this paper, we describe a simple method of acquiring the image data set. The method requires a normal handheld video camera, which is taken around the object to be rendered. We employ homography from the viewing/camera plane to the lightfield plane for obtaining the ray intersections with the lightfield planes. The computations involved are simple and make the method suitable for online lightfield acquisition

CiteSeerX